36 research outputs found

    Simpler is better: a novel genetic algorithm to induce compact multi-label chain classifiers

    Get PDF
    Multi-label classification (MLC) is the task of assigning multiple class labels to an object based on the features that describe the object. One of the most effective MLC methods is known as Classifier Chains (CC). This approach consists in training q binary classifiers linked in a chain, y1 → y2 → ... → yq, with each responsible for classifying a specific label in {l1, l2, ..., lq}. The chaining mechanism allows each individual classifier to incorporate the predictions of the previous ones as additional information at classification time. Thus, possible correlations among labels can be automatically exploited. Nevertheless, CC suffers from two important drawbacks: (i) the label ordering is decided at random, although it usually has a strong effect on predictive accuracy; (ii) all labels are inserted into the chain, although some of them might carry irrelevant information to discriminate the others. In this paper we tackle both problems at once, by proposing a novel genetic algorithm capable of searching for a single optimized label ordering, while at the same time taking into consideration the utilization of partial chains. Experiments on benchmark datasets demonstrate that our approach is able to produce models that are both simpler and more accurate

    Recruiting from the network: discovering Twitter users who can help combat Zika epidemics

    Full text link
    Tropical diseases like \textit{Chikungunya} and \textit{Zika} have come to prominence in recent years as the cause of serious, long-lasting, population-wide health problems. In large countries like Brasil, traditional disease prevention programs led by health authorities have not been particularly effective. We explore the hypothesis that monitoring and analysis of social media content streams may effectively complement such efforts. Specifically, we aim to identify selected members of the public who are likely to be sensitive to virus combat initiatives that are organised in local communities. Focusing on Twitter and on the topic of Zika, our approach involves (i) training a classifier to select topic-relevant tweets from the Twitter feed, and (ii) discovering the top users who are actively posting relevant content about the topic. We may then recommend these users as the prime candidates for direct engagement within their community. In this short paper we describe our analytical approach and prototype architecture, discuss the challenges of dealing with noisy and sparse signal, and present encouraging preliminary results

    A Novel Feature Selection Method for Uncertain Features: An Application to the Prediction of Pro-/Anti- Longevity Genes

    Get PDF
    Understanding the ageing process is a very challenging problem for biologists. To help in this task, there has been a growing use of classification methods (from machine learning) to learn models that predict whether a gene influences the process of ageing or promotes longevity. One type of predictive feature often used for learning such classification models is Protein-Protein Interaction (PPI) features. One important property of PPI features is their uncertainty, i.e., a given feature (PPI annotation) is often associated with a confidence score, which is usually ignored by conventional classification methods. Hence, we propose the Lazy Feature Selection for Uncertain Features (LFSUF) method, which is tailored for coping with the uncertainty in PPI confidence scores. In addition, following the lazy learning paradigm, LFSUF selects features for each instance to be classified, making the feature selection process more flexible. We show that our LFSUF method achieves better predictive accuracy when compared to other feature selection methods that either do not explicitly take PPI confidence scores into account or deal with uncertainty globally rather than using a per-instance approach. Also, we interpret the results of the classification process using the features selected by LFSUF, showing that the number of selected features is significantly reduced, assisting the interpretability of the results. The datasets used in the experiments and the program code of the LFSUF method are freely available on the web at http://github.com/pablonsilva/FSforUncertainFeatureSpaces

    A survey of genetic algorithms for multi-label classification

    Get PDF
    In recent years, multi-label classification (MLC) has become an emerging research topic in big data analytics and machine learning. In this problem, each object of a dataset may belong to multiple class labels and the goal is to learn a classification model that can infer the correct labels of new, previously unseen, objects. This paper presents a survey of genetic algorithms (GAs) designed for MLC tasks. The study is organized in three parts. First, we propose a new taxonomy focused on GAs for MLC. In the second part, we provide an up-to-date overview of the work in this area, categorizing the approaches identified in the literature with respect to the taxonomy. In the third and last part, we discuss some new ideas for combining GAs with MLC

    DMC-GRASP: A continuous GRASP hybridized with data mining

    Get PDF
    The hybridization of metaheuristics with data mining techniques has been successfully applied to combinatorial optimization problems. Examples of this type of strategy are DM-GRASP and MDM-GRASP, hybrid versions of the Greedy Randomized Adaptive Search Procedure (GRASP) metaheuristic, which incorporate data mining techniques. This type of hybrid method is called Data-Driven Metaheuristics and aims at extracting useful knowledge from the data generated by metaheuristics in their search process. Despite success in combinatorial problems like the set packing problem and maximum diversity problem, proposals of this type to solve continuous optimization problems are still scarce in the literature. This work presents a data mining hybrid version of C-GRASP, an adaptation of GRASP for problems with continuous variables. We call this new version DMC-GRASP, which identifies patterns in high-quality solutions and generates new solutions guided by these patterns. We performed computational experiments with DMC-GRASP on a set of well-known mathematical benchmark functions, and the results showed that metaheuristics for continuous optimization could also benefit from using patterns to guide the search for better solutions

    Um GRASP Híbrido com Reconexão por Caminhos e Mineração de Dados

    Get PDF
    A exploração de metaheurísticas híbridas – combinação de metaheurísticas com conceitos e processos de outras áreas – vem sendo uma importante linha de pesquisa em otimização combinatória. Neste trabalho, propõe-se uma versão híbrida da metaheurística GRASP que incorpora a técnica de reconexão por caminhos e um módulo de mineração de dados. Experimentos computacionais mostraram que a combinação da técnica de reconexão por caminhos com mineração de dados contribuiu para que o GRASP encontrasse soluções melhores em um menor tempo computacional. Outra contribuição deste trabalho é a aplicação dessa proposta híbrida ao problema de síntese de redes a 2-caminhos, que proporcionou encontrar melhores soluções para esse problema

    Mineração de Exceções Aplicada aos Sistemas para Detecção de Intrusões

    Get PDF
    Os sistemas para a detecção de intrusões em redes de computadores freqüentemente utilizam modelos baseados em regras para o reconhecimento de padrões suspeitos nos dados do tráfego. Este trabalho apresenta uma técnica baseada na mineração de exceções que pode ser utilizada para aumentar a eficiência deste tipo de sistema. As exceções representam regras de associação que tornam-se extremamente fortes (exceções positivas) ou extremamente fracas (exceções negativas) em subconjuntos de uma base de dados que satisfazem condições específicas sobre atributos selecionados. São apresentados os resultados obtidos a partir da aplicação desta técnica sobre a base KDDCup99 que registra informações sobre conexões de rede

    Prioritizing positive feature values: a new hierarchical feature selection method

    Get PDF
    In this work, we address the problem of feature selection for the classification task in hierarchical and sparse feature spaces, which characterise many real-world applications nowadays. A binary feature space is deemed hierarchical when its binary features are related via generalization-specialization relationships, and is considered sparse when in general the instances contain much fewer “positive” than “negative” feature values. In any given instance, a feature value is deemed positive (negative) when the property associated with the feature has been (has not been) observed for that instance. Although there are many methods for the traditional feature selection problem in the literature, the proper treatment to hierarchical feature structures is still a challenge. Hence, we introduce a novel hierarchical feature selection method that follows the lazy learning paradigm – selecting a feature subset tailored for each instance in the test set. Our strategy prioritizes the selection of features with positive values, since they tend to be more informative – the presence of a relatively rare property is usually a piece of more relevant information than the absence of that property. Experiments on different application domains have shown that the proposed method outperforms previous hierarchical feature selection methods and also traditional methods in terms of predictive accuracy, selecting smaller feature subsets in general

    Is P-value<0.05 Enough? Two Case Studies in Classifiers Evaluation

    Get PDF
    A common tool used in the process of comparing classifiers is the statistical significance analysis, performed through the hypothesis test. However, there are many researchers attempting to obtain statistical significance through a blinding evaluating of the p-value<0.05 condition, ignoring important concepts such as the effect size and statistical power. This work highlights possible problems caused by the misuse of the hypothesis test and how the effect size and the statistical power can provide information for a better decision making. Therefore, two case studies applying Student’s t-test and Wilcoxon signed-rank test for the comparison of two classifiers are presented
    corecore